In this lab, we will continue the steps of a geostatistical analysis, using the set of precipitation data for Switzerland, and the data set of soil samples from a floodplain near the Meuse river in the Netherlands.

Before starting the lab, you will need to set up a new folder for your working directory. Go to your geog6000 folder now and create a new folder for today’s class called lab11. The following files will be used in this lab, all available on Canvas:

  • swiss_ppt.zip. This contains a set of point observations of rainfall in Switzerland
  • meuse.zip. This contains the shapefiles for the Meuse soil sample data set as well as the river borders
  • ne_50m_admin_0_countries.zip. This contains a shapefile of country outlines
  • oregon.zip: Oregon temperature dataset

You should have these files from last week’s lab, but if not you will need to download these files from Canvas, and move them from your Downloads folder to the datafiles folder that you made previously. Make sure to unzip the zip files so that R can access the content. Note that on Windows, you will need to right-click on the file and select ‘Extract files’.

Now start RStudio and change the working directory to lab11. As a reminder, you can do this by going to the [Session] menu in RStudio, then [Change working directory]. This will open a file browser that you can use to browse through your computer and find the folder.

You will also need several packages to carry out all the steps listed here: sf, stars, RColorBrewer and gstat. Again, you should have these from last weeks lab, but if not make sure these are installed these before starting.

pkgs <- c("gstat",
          "stars",
          "RColorBrewer",
          "sf")

install.packages(pkgs)
library(ggplot2)
library(gstat)
library(RColorBrewer)
library(sf)
library(stars)
library(viridis)

With all the examples given, it is important to not just type in the code, but to try changing the parameters and re-running the functions multiple times to get an idea of their effect. Help on the parameters can be obtained by typing help(functionname) or ?functionname.

Setting projections for gstat

By default gstat assumes that your data are in a Cartesian projection, unless the Spatial* object containing the data has projection metadata specifying that it is on a spherical (lat/lon) coordinate system. Without this, longitude and latitude coordinates will be treated as Cartesian, and distances will be incorrectly calculated for variogram analysis and kriging. We can illustrate this with the Oregon dataset. Load this and check the associated projection (this should read ‘NA’):

oregon = st_read("../datafiles/oregon/oregontann.shp", quiet = TRUE)

st_crs(oregon)
## Coordinate Reference System: NA

Let’s make a quick plot of the temperature data:

# names(oregon)

plot(oregon["tann"], pch = 16)

This shows a fairly clear pattern with warmer temperatures at lower elevations and closer to the coast. We’ll now make a variogram for to explore this pattern (first load the gstat package):

or_vgm <- variogram(tann ~ 1, oregon)

plot(or_vgm)

The distance lags are calculated in degrees. To correct this, we can either project the data or simply specify that these data are spherical coordinates. In general, the second of these is a better approach, unless you are working in a very small area, as all projections will distort distances and/or directions over large areas. Here, we set the st_crs to EPSG 4326, which is the WGS84 standard:

st_crs(oregon) <- 4326

Now if we remake the variogram, gstat recognizes this as spherical coordinates, and calculates distances as great circle distances in km, rather than Euclidean degrees, reducing the bias in distances.

or_vgm <- variogram(tann ~ 1, oregon)

plot(or_vgm)

Reading the data

This code is taken from last week’s lab and will load and convert all the necessary files for the Swiss precipitation data:

## Precipitation data
swiss <- read.csv("../datafiles/swiss_ppt/swiss_ppt.csv")
swiss.sf <- st_as_sf(swiss, 
                     coords = c("x", "y"),
                     crs = 2056)

## Elevation grid
swiss.dem <- read_stars("../datafiles/swiss_ppt//swiss_dem.asc")
st_crs(swiss.dem) <- 2056

## Swiss border
countries <- st_read("../datafiles/ne_50m_admin_0_countries/ne_50m_admin_0_countries.shp", quiet = TRUE)
swiss.bord <- subset(countries, NAME == "Switzerland")
swiss.bord <- st_transform(swiss.bord, 2056)

Make a simple plot of precipitation amounts:

plot(swiss.sf["ppt"], reset = FALSE, pch = 16)
plot(st_geometry(swiss.bord), add = TRUE)

Alternatively, we can use the ggplot2 library:

ggplot() + 
  geom_sf(data = swiss.bord) +
  geom_sf(data = swiss.sf, aes(col = ppt), size = 2.5) +
  scale_color_viridis_c(direction = -1) +
  theme_bw()

Now load the Meuse datasets:

meuse <- st_read("../datafiles/meuse/meuse.shp", quiet = TRUE)
st_crs(meuse) <- 28992

meuse.riv <- st_read("../datafiles/meuse/meuseriv.shp", quiet = TRUE)
st_crs(meuse.riv) <- 28992

meuse.grid <- st_read("../datafiles/meuse/meusegrid.shp", quiet = TRUE)
st_crs(meuse.grid) <- 28992
meuse.grid <- st_rasterize(meuse.grid["dist"], dx = 40, dy = 40)

A histogram of the Swiss precipitation data shows that it is right-skewed. Before carrying out any analysis, we’ll log transform it to normalize the distribution. Note that as there are several sites that have zero precipitation we also add a small positive value to these sites to make the log transformation possible.

hist(swiss.sf$ppt)

swiss.sf$lppt <- log(swiss.sf$ppt + 1e-1)

Kriging with external drift

Kriging can be extended to include covariates to improve prediction. One of the most flexible methods is kriging with an external drift, where the drift refers to any covariate that has been recorded both at the sampling locations and at the prediction locations. A scatterplot of the precipitation against the elevation of each site shows a weak negative relationship, with generally higher values at lower elevations:

plot(lppt ~ elev, swiss.sf)

We will use this model together with elevation data from the DEM to perform kriging with an external drift on the precipitation data. As we are incorporating a covariate in our model, we first need to remake the precipitation variogram, so that the variogram takes this into account:

ppt.var <- variogram(lppt ~ elev, swiss.sf)

plot(ppt.var, plot.numbers = TRUE)

Now let’s fit a model as we did before:

modNugget <- 0.05
modRange <- 100000
modSill <- 0.75

ppt.vgm1 <- vgm(psill = modSill, 
                "Sph", 
                range = modRange, 
                nugget = modNugget)

ppt.vgm2 <- fit.variogram(ppt.var, ppt.vgm1)

plot(ppt.var, ppt.vgm2, main = "Swiss precip. variogram")

We will now use this variogram to interpolate the precipitation data. We use the same function as in the previous lab (krige()), but now specify ‘elev’ as an independent variable in the model formula. This requires that both the spatial points (swiss.sf) and the new locations (swiss.dem) have a variable called `elev’, so let’s check this first:

## Check to see that both the data and grid have elev variables
names(swiss.sf)
## [1] "id"       "ppt"      "elev"     "geometry" "lppt"
names(swiss.dem)
## [1] "swiss_dem.asc"

This shows that the swiss.dem variable was named using the original file name. let’s rename this

names(swiss.dem) <- "elev"

Now we can go ahead and run the model:

## kriging with external drift
ppt.pred.ked <- krige(lppt ~ elev, 
                      swiss.sf, 
                      swiss.dem, 
                      ppt.vgm2)
## [using universal kriging]

Plot the new predictions:

# names(ppt.pred.ked)

my.pal = brewer.pal(9, "Blues")
plot(ppt.pred.ked["var1.pred"], 
     col = my.pal, 
     main = "Swiss log precipitation (KED)")

Note that we can also back-transform to precipitation in mm by using exp():

ppt.pred.ked$ppt <- exp(ppt.pred.ked$var1.pred)

plot(ppt.pred.ked["ppt"], 
     col = my.pal, 
     main = "Swiss precipitation (KED)")

As before, we can estimate the prediction skill of the model using cross-validation:

ppt.cv.ked <- krige.cv(lppt ~ elev, 
                       swiss.sf, 
                       ppt.vgm2, 
                       nmax = 40, 
                       nfold = 5)
## RMSE
sqrt(mean(ppt.cv.ked$residual^2))
## [1] 0.4114167
## R2
cor(ppt.cv.ked$observed, ppt.cv.ked$var1.pred)^2
## [1] 0.7960134

Regression kriging

Regression kriging provides a more flexible approach to including covariates than universal kriging or external drift kriging, but requires a little more work. The idea is that rather than trying to incorporate a potentially complicated relationship in the model, this is modeled separately, then the residuals are interpolated using simple kriging. A final estimate at each new location can then be made by adding the predicted value from the original model to the interpolated residual. This opens up the possibility of using any regression technique with the covariate(s) to model the larger, structural trends, and then using simple kriging to model the deviations from this trend.

As a simple example, we will build a linear model of precipitation using elevation as the covariate, and use the residuals from this as the basis for kriging. Note that as we have specified the covariate in the regression model, we no longer need to include it in the variogram or the krige() function.

fit1 <- lm(lppt ~ elev, swiss.sf)

swiss.sf$resid <- residuals(fit1)

resid.var <- variogram(resid ~ 1, swiss.sf)

plot(resid.var)

Now fit a variogram model to this:

resid.vgm <- vgm(0.75, "Cir", 100000, 0.05)

plot(resid.var, resid.vgm)

We now interpolate the residuals (I haven’t fit the variogram model to the sample data, but feel free to do so). Note that as these are residuals, the mean is assumed known (\(=0\)), so we use simple kriging for the interpolation. The parameter beta is used to set the value of the mean:

resid.sk <- krige(resid ~ 1, 
                  swiss.sf, 
                  swiss.dem, 
                  resid.vgm, 
                  nmax = 40, 
                  beta = 0)
## [using simple kriging]
my.pal <- brewer.pal(9, "PRGn")

plot(resid.sk["var1.pred"], 
     col = my.pal, 
     main = "Swiss precipitation residuals (RK)")

To make the final estimates, we first predict the precipitation using the simple linear model and the elevation values on the DEM, and store this in the stars DEM object:

swiss.dem$ppt.lm <- predict(fit1, swiss.dem)

Now we add the interpolated residuals to these values:

swiss.dem$ppt.rk <- swiss.dem$ppt.lm + resid.sk$var1.pred

And finally visualize the predictions:

my.pal <- brewer.pal(9, "Blues")

plot(swiss.dem["ppt.rk"], 
     col = my.pal, 
     main = "Swiss precipitation (Regression Kriging)")

We can, of course, back-transform the data to the non-log scale using exp(). While the results here are not that much different to the kriging with an external drift listed above, the method is much more flexible. We could replace the linear model with generalized linear models, additive models, mixed-effects models and even machine learning methods.

Indicator Kriging

Indicator kriging is used to interpolate binary variables as probabilities. It can be used to estimate whether the variable of interest will be over or below a given threshold at a new location, or the probability that a new location will have a binary or categorical variable. In either case, the method consists quite simply of interpolating binary values (0/1) using ordinary kriging. As this uses the same function as before, you can include a trend or external drift if necessary.

Thresholds

We’ll first use this method to find all locations in Switzerland with over 40mm of rainfall. Here, we create a new variable in the swiss.sf data frame, which is whether or not the station had \(>\) 40mm rainfall, and use spplot() to make a quick figure.

swiss.sf$ppt40 <- swiss.sf$ppt > 40

plot(swiss.sf["ppt40"], 
     pch = 16, 
       main = "PPT > 40mm")

Now we proceed as in the previous lab:

  1. Create the sample variogram
ppt40.var <- variogram(ppt40 ~ 1, swiss.sf)

plot(ppt40.var)

ppt40.vgm <- vgm(0.035, "Sph", 40000, 0.01)

plot(ppt40.var, ppt40.vgm)

  1. Fit a variogram model
ppt40.vgm2 <- fit.variogram(ppt40.var, ppt40.vgm)

plot(ppt40.var, ppt40.vgm2)

  1. Interpolate using ordinary kriging:
ppt40.ik <- krige(ppt40 ~ 1, 
                  swiss.sf, 
                  swiss.dem, 
                  ppt40.vgm2, 
                  nmax = 40)
## [using ordinary kriging]
  1. Plot the new predictions:
plot(ppt40.ik["var1.pred"])

Note that these are not true probabilities (some values \(<0\) are obtained). For the purposes of geostatistical interpolation, however, these are considered as close to being probabilities and are often corrected to between 0 and 1, which we’ll do next. Indicator simulation (see below) offers a method to interpolate true probabilities.

First, we use the which() function to find all pixels with a value below zero and reset this to zero. We then plot using one of the viridis color palettes (you will need to make sure this is installed):

ppt40.ik$var1.pred[which(ppt40.ik$var1.pred < 0)] <- 0

my.pal <- rev(viridis::magma(10))

plot(ppt40.ik["var1.pred"], 
     col = my.pal, 
     breaks = seq(0, 1, length.out = 11),
     main = "P(ppt > 40mm)")

Categorical variables

The Meuse data set contains soil type for each of the sampling sites, split into three classes. Use the spplot() function to look at their spatial distribution:

plot(meuse["soil"], pch = 16, reset = FALSE)
plot(meuse.riv, add = TRUE, col = NA)

The individual categories can be interpolated to new locations using indicator kriging. As before we start by making and modeling the variogram, then interpolate onto a grid.

Star by making and modeling the sample variogram. Note that rather than creating a new variable, we use the I() function, which tells R to create a new variable internally in a model, here a binary value where soil class 1 equals 1, and other classes equal zero:

s1.var <- variogram(I(soil == 1) ~ 1, meuse, cutoff = 2000)

s1.vgm <- vgm(psill = 0.25, model = "Sph", range = 900, nugget = 0.1)

s1.vgm <- fit.variogram(s1.var, s1.vgm)

plot(s1.var, s1.vgm, main = "Soil class 1")

We now interpolate using the krige() function:

s1.ik <- krige(I(soil == 1) ~ 1, meuse, meuse.grid, s1.vgm)
## [using ordinary kriging]
my.pal <- brewer.pal(9, "Greens")

plot(s1.ik["var1.pred"], 
     col = my.pal, 
     main = "P(Soil == 1)", 
     reset = FALSE)

plot(meuse.riv, col = NA, add = TRUE)

Now do the same for the other two soil classes, to get interpolated probabilities for class 2 (s2.ik) and 3 (s3.ik).

s2.var <- variogram(I(soil == 2) ~ 1, meuse, cutoff = 2000)

vgm_model <- vgm(psill = 0.25, model = "Sph", range = 900, nugget = 0.1)

s2.vgm <- fit.variogram(s2.var, model = vgm_model)

plot(s2.var, s2.vgm, main = "Soil class 2")

s2.ik <- krige(I(soil == 2) ~ 1, meuse, meuse.grid, s2.vgm)
## [using ordinary kriging]
s3.var <- variogram(I(soil == 3) ~ 1, 
                    meuse, 
                    cutoff = 2000)

s3.vgm <- fit.variogram(s3.var, model = vgm_model)

plot(s3.var, s3.vgm, main = "Soil class 3")

s3.ik <- krige(I(soil == 3) ~ 1, meuse, meuse.grid, s3.vgm)
## [using ordinary kriging]

Once you have the probabilities for all three classes, we can use these to estimate the most probable class at each new location. We do this in three steps: first, combine the individual probability interpolations into a single matrix; second, find for each row, the column with the highest probability using max.col() and assign the output as a new variable in meuse.grid; third, plot the results.

soil.prob <- cbind(c(s1.ik$var1.pred), 
                   c(s2.ik$var1.pred), 
                   c(s3.ik$var1.pred))

meuse.grid$soil.pred <- max.col(soil.prob)

my.pal <- brewer.pal(3, "Set2")

plot(meuse.grid["soil.pred"], col = my.pal)

This can be extended to larger numbers of classes, as long as there is sufficient data to produce a variogram for each one, and that you have the patience to model them all.

Geostatistical simulation

All geostatistical simulation methods are designed to produce random spatial fields, where the value at each location is produced by random draws from a probability distribution function defined by the observations. In contrast to straightforward generation of random values, spatially random fields produce random values at each location, but while preserving spatial structure. Individual simulations are much less smooth than kriging interpolation, as the values at any two neighboring locations are randomly chosen, but are spatially correlated as described by a variogram.

Geostatistical simulations come in two forms: constrained and unconstrained. In the unconstrained type, the random field is based on a specified mean and variance, and the variogram for spatial structure. The random fields produced have the same statistical and spatial characteristics, but the minima and maxima may occur anywhere in the study area. Constrained simulations also include the location and value of the observed points. This ensures that minima and maxima occur where they are defined by the original points, and the resulting fields have the same pattern as the original data. We will concentrate here on the constrained type of simulation.

Gaussian simulation

Like ordinary kriging, Gaussian simulation can be performed with continuous variables, and uses a similar setup to the kriging carried out above and in previous labs. We’ll use this with the Swiss precipitation data, so first make a variogram and fit a variogram model.

modNugget <- 0.05
modRange <- 100000
modSill <- 0.75

ppt.var <- variogram(lppt ~ 1, swiss.sf)

ppt.vgm1 <- vgm(psill = modSill, 
                "Sph", 
                range = modRange, 
                nugget = modNugget)

ppt.vgm2 <- fit.variogram(ppt.var, ppt.vgm1)

plot(ppt.var, ppt.vgm2, main = "Swiss precip. variogram")

To carry out 6 random simulations of the Swiss precipitation data, we use the krige() function again, with the spatial data, output grid, variogram, etc. The new parameter used here is nsim which controls the number of output simulations:

ppt.pred.sgs <- krige(lppt ~ 1, 
                      swiss.sf, 
                      swiss.dem, 
                      ppt.vgm2, 
                      nmax = 40, 
                      nsim = 6)
## drawing 6 GLS realisations of beta...
## [using conditional Gaussian simulation]

The spplot() function is very useful for visualizing the output as it plots a grid of all required simulations:

my.pal <- brewer.pal(9, "Blues")

plot(ppt.pred.sgs, 
     col = my.pal, 
     main = "Swiss ppt (SGS)")

In general, single simulations are only of interest to examine the degree of variation between observations (as opposed to kriging which provides smoothed interpolations). The power of the simulation approach is in producing a large number of possible realizations of a spatial field. This allows a better assessment of uncertainty at any location, as we can obtain not just the mean estimated value and a confidence interval, but the full probability distribution of interpolated values.

In the next bit of code, we will produce one hundred simulations of precipitation, then extract the predicted values for a single point and make this into a histogram. So first, re-run the krige function with a higher number of simulations:

ppt.pred.sgs <- krige(lppt ~ 1, 
                      swiss.sf, 
                      swiss.dem, 
                      ppt.vgm2, 
                      nmax = 40, 
                      nsim = 100)
## drawing 100 GLS realisations of beta...
## [using conditional Gaussian simulation]

Next, we create a new point location to extract the simulated values.

newloc <- st_geometry(st_point(c(2650000, 1200000)))

st_crs(newloc) <- st_crs(swiss.dem)

plot(swiss.dem, reset = FALSE, axes = TRUE)

plot(newloc, pch = "x", cex = 3, col = 2, add = TRUE)

We use the function st_extract() for extracting points, and then plot the resulting values as a histogram:

newloc.ppt <- st_extract(ppt.pred.sgs, newloc)

newloc.ppt$ppt <- exp(newloc.ppt$var1)

hist(newloc.ppt$ppt, 
     breaks = 20, 
     col = "darkorange", 
     main = "Precipitation at (2650000, 1200000)", 
     xlab = "mm")

Note that as we are using the same formula notation as kriging, we could easily extend the basic simulation approach to include covariates to represent external drifts or trends.

Indicator simulation

The simulation approach can also be used for indicator interpolation. As in the previous example, we simply reuse the krige() function. In addition to the nsim parameter, we include a parameter indicators=TRUE to perform indicator kriging. In this case, rather than each simulation estimating a probability at each new location, the method estimates a binary value (presence or absence) by drawing from a binomial distribution. To simulate 6 realizations of the distribution of soil class 1:

s1.sis <- krige(I(soil == 1) ~ 1, 
                meuse, 
                meuse.grid, 
                s1.vgm, 
                nsim = 6, 
                indicators = TRUE, 
                nmax = 40)
## drawing 6 GLS realisations of beta...
## [using conditional indicator simulation]
plot(s1.sis)

If we now run multiple simulations, we can assess uncertainty. In this case, we get 1000 simulations of the presence of soil type 1 at each location. To estimate probability, we then take the sum of all presences (1) and divide by the number of simulations (1000). This is stored in the output, and then can be plotted using spplot().

s1.sim <- krige(I(soil == 1) ~ 1, 
                meuse, 
                meuse.grid, 
                s1.vgm,
                nsim = 1000, 
                indicators = TRUE, 
                nmax = 40)
## drawing 1000 GLS realisations of beta...
## [using conditional indicator simulation]
s1.prob <- st_apply(s1.sim, c(1,2), sum) / 1000

plot(s1.prob, main = "P(Soil == 1)")

In contrast to the indicator kriging, the output from this function provides true probabilities of presence, constrained to the range [0,1].

Exercise

  1. The compressed file mwozone.zip contains measurements of ground ozone concentration in the midwestern US for June 20, 1987 in the shapefile ozone.shp. The goal of this exercise is use this data to produce a map showing the probability that a certain threshold (100 ppb) was passed on that date. You can do this using either indicator kriging or indicator simulation, using the examples provided in the lab. You are free to approach this in your own way, but you will need to provide at least the following:
  • A map of ozone concentrations at the measurement stations
  • A plot of the sample variogram describing the spatial structure of the condition you want to interpolate (ozone \(>\) 100 ppb), and a brief description of the spatial structure you observe
  • A variogram model fitted to this (as a figure), and give the model chosen, the sill, range and nugget value (if appropriate)
  • A map of predicted probabilities of passing 100 ppb ozone. In order to do this, you will need to create a longitude/latitude grid for interpolation. See code below for how to do this simply in R
  • A brief description of the spatial pattern shown on the interpolated map — what do you think the higher probabilities relate to?

Code to add shapefiles

It may help to add other shapefiles to your maps for this exercise. The following code assumes that you have read in the ozone shapefile to a sf object called ozone.sf, and adds state outlines, lakes and cities using the Natural Earth datasets. The example here plots out the ozone concentrations, but you can adapt this pretty easily for the predicted values you obtain. Note that to run this code, you will need to make sure the tmap package is installed and download and unzip the following files from Canvas:

  • ne_50m_admin_1_states_provinces.zip
  • ne_50m_lakes.zip
  • ne_50m_populated_places.zip

To overlay multiple shapefiles, we can make use of the reset and add arguments when plotting sf objects. Start by reading in the three extra layers:

ozone.sf <- st_read("../datafiles/mwozone/ozone.shp")
st_crs(ozone.sf) <- 4326
## Read data
states <- st_read("../datafiles/ne_50m_admin_1_states_provinces/ne_50m_admin_1_states_provinces_shp.shp")
lakes <- st_read("../datafiles/ne_50m_lakes/ne_50m_lakes.shp")
places <- st_read("../datafiles/ne_50m_populated_places/ne_50m_populated_places.shp")

Now we plot the layers. Each time we add the argument reset = TRUE to tell R to keep the same basic plot outline. For the second (and subsequent plots) we also set add = TRUE to prevent R from resetting the plot.

plot(st_geometry(ozone.sf), reset = FALSE)
plot(st_geometry(lakes), reset = FALSE, add = TRUE, col = "lightblue")
plot(st_geometry(states), reset = FALSE, add = TRUE)
plot(ozone.sf["ozone"], add = TRUE, pch = 16)

Better plots with multiple layer can be made by using the geom_sf function from ggplot2:

places <- cbind(places, st_coordinates(st_centroid(places)))

ggplot() +
  geom_sf(data = lakes, fill = "lightblue") +
  geom_sf(data = states, fill = NA) +
  geom_sf(data = ozone.sf, aes(col = ozone), size = 2) +
  scale_color_viridis_c() +
  geom_label(data = places, aes(X, Y, label = NAME), size = 2.5) +
  coord_sf(xlim = c(-94, -82), ylim = c(36, 45), expand = FALSE) +
  theme_bw()

Alternatively you can use the tmap package to make a better plot:

library(tmap)

mybbox <- st_bbox(ozone.sf)
tm_shape(lakes, bbox = mybbox) + 
  tm_fill("lightblue") + 
  tm_shape(states) + 
  tm_borders() +
  tm_shape(ozone.sf) + 
  tm_symbols(col = "ozone", size = 0.5, palette = "viridis") +
  tm_shape(places) + 
  tm_text("NAME", size = 0.75)

Code to make a prediction grid

pred.grid <- st_as_stars(mybbox, 
                         xlim = c(-94, -82), 
                         ylim = c(36, 45), 
                         dx = 0.1, 
                         dy = 0.1)

st_crs(pred.grid) <- 4326